Stable Diffusion

GitHub - CompVis/stable-diffusion

https://stability.ai/blog/stable-diffusion-announcement

latent text-to-image diffusion model

Latent diffusion model

LAION-5B databaseのsubsetであるLAION-Aestheticsで学習

4000台の NVIDIA A100 Ezra-1 AIで訓練

Ezra-1が調べてもわからなかった基素.icon

Stability AIが開発

Stability AI is building open AI tools that will let us reach our potential.

https://stability.ai/

どうもEmadさんがお金を集めているようだ

モデル自体はCompVisとRunwayが開発

The model itself builds upon the work of the team at CompVis and Runway in their widely used latent diffusion model combined with insights from the conditional diffusion models by our lead generative AI developer Katherine Crowson, DALL·E 2 by Open AI, Imagen by Google Brain and many others.

モデルを無償公開している

さまざまな派生モデルが生まれている https://rentry.org/sdmodels

Stable Diffusion launch announcement — Stability.Ai

10GB未満のVRAMで512x512の画像を生成する

画像生成を民主化する

ベータテスターは10000人、一日170万枚作る

The research origins of Stable Diffusion | Runway Research

画像生成AI「Stable Diffusion」を使いこなすために知っておくと理解が進む「どうやって絵を描いているのか」をわかりやすく図解 - GIGAZINE

一般人向けのやや踏み込んだ解説

/nishio/Stable Diffusion勉強会

仕組みの解説

@birdMan710Nika: stable diffusionを理解するために読んどくと幸せになれるリストです

今から勉強する学生さんは大変ですね

抜け漏れtypoあったら教えてください

https://pbs.twimg.com/media/FeN-ThcVUAAkdpF.png

CNNとかは前提

U-Net

DDPM

Improved Denoising Diffusion Probabilistic Models（2021）

DDIM

PNDM

Diffusion Models Beat GANs on Image Synthesisで生成モデルより拡散モデルの方がええという提案がされる

classifier guidanceを使っている

Classifier-Free Diffusion Guidanceでclassifierなくてもあるのと同等のsample quality/diversityが実現できると提案される

@birdMan710Nika: 俺の感想ですが、特にclassifier-free diffusion guidanceの式は、stable diffusionが「空のプロンプト」を取っている理由やscaleを理解するために役に立ちました

これにAttention mechanism→CLIPの流れがあり

GLIDE（2021）（Diffusion Model）を軽くして普通のゲーミングPCでも動くようにしたLatent diffusion model

Stable Diffusion を基礎から理解したい人向け論文攻略ガイド【無料記事】

基礎編

U-Net (Ronneberger et al., 2015)

Vision Transformer; ViT; Dosovitskiy et al., 2020

CLIP (Radford et al., 2021)

拡散モデルの基礎

NCSN (noise conditional score networks; Song and Ermon, 2019)

birdMan氏の図にはない

拡散モデルに取り組む前に、概念的に知っておくと役に立つのが、「スコアベースの生成手法 (Song and Ermon, 2019)」です

生成モデルの本質的な問題

画像などの個々のデータは、多次元空間上の点で表されます。高画質な画像を生成するためには、この「点が多くあつまっていそうな領域」から新たな点を生成すると上手く行くと考えられます。

「sparseなところに情報はなさそう」という漠然な感覚で読み飛ばしてしまうが、実際には何を言っているのか理解できていない基素.icon

/villagepump/2022/10/24で教えてもらったところによると

512×512のRGB画像はすべて512×512×3次元の空間の一点であるとみなせる

しかし、この空間からランダムな1点をとっても「真っ黒」とか「単なるノイズに見える」などの「人間にとって意味のない画像」がほとんど

逆に言えば「人間にとって意味のある絵」はこの空間の中のある狭い領域に密集している

別の表現: 一様分布ではないなんらかの分布になっている

この「狭い領域」を人間がルールベースで指定することは困難だが、機械学習を使えば具体例としてデータを与えて「その周辺」を表現することができる

別の表現: 学習データから分布を獲得することができる

その領域から選ぶ(分布からサンプリングする)ことができれば、新しい画像の生成ができる

現実のデータのあつまり (分布) をどのようにデータから推定し、そこから新たな点をサンプルすればいいのか？

点の多くが集まっている方向がスコア

後で分かりますが、この「どちらに進んだらリアルな画像に近づくか」という方向は、ニューラルネットワークで推定することができ、ノイズを除去していくことでデータを生成する拡散モデルと、このスコアマッチングによる生成は、数式の係数などの細かい違いを除いて、基本的に等価であることが知られています。

https://www.youtube.com/watch?v=8TcNXi3A5DI

著者の一人Stefano Ermonの解説

スコアベースの手法では、ノイズのような適当なデータから始め、徐々に変形させていくことによりデータを生成します。

DDPM (denoising diffusion probabilistic models; Ho et al., 2020)

拡散モデルの発展

DDIM (denoising diffusion implicit models; Song et al., 2020)

Improved DDPM (Nichol and Dhariwal, 2021)

ADM (ablated diffusion model; Dhariwal and Nichol, 2021)

Diffusion Models Beat GANs on Image Synthesisのこと

ここまでは条件なし/クラス条件での画像生成

自由なテキストによる画像生成がしたい！→GLIDE（2021） (Nichol et al., 2021)

Stable Diffusion

LDM (Latent diffusion model, Rombach et al., 2021)

【CEO直撃】THE GUILD深津氏が画像生成AI「Stable Diffusion」開発元に聞く、AIビジネスの“新時代” | DIAMOND SIGNAL

founder https://twitter.com/EMostaque

公式discordがある

https://discord.com/invite/stablediffusion

wave 2のbetaを募集している

joinした

近く2D版も出すらしい

@EMostaque: 近日安定拡散アニメ版!🦾

founderらしい

https://huggingface.co/stabilite

世界変革の前夜は思ったより静か｜深津貴之 (fladdict)｜note

観測範囲で日本でバズったのはこの記事

Midjourneyより高性能らしい

Midjourneyはその後Stable diffusionを食べたらしい（要出典）のでわからない。AIは日進月歩基素.icon

2022年8月22日オープンに公開された

https://stability.ai/blog/stable-diffusion-public-release

Creative ML OpenRAIL-M license

商用利用可能

倫理・法的な利用に側面を当てたライセンスらしい

https://huggingface.co/CompVis/stable-diffusion

model

The recommended model weights are v1.4 470k, a few extra training steps from the v1.3 440k model made available to researchers. The final memory usage on release of the model should be 6.9 Gb of VRAM.

GTX1070でもギリいける

In the coming period we will release optimized versions of this model along with other variants and architectures with improved performance and quality. We will also release optimisations to allow this to work on AMD, Macbook M1/M2 and other chipsets. Currently NVIDIA chips are recommended.

ユーザーレビュー

https://twitter.com/kawai_nae/status/1561843126842851328?s=21&t=jlDwPxrQ-eJ6RFXDHO5QWg

Colabでの実行

https://note.com/npaka/n/ndd549d2ce556

prompt guide https://beta.dreamstudio.ai/prompt-guide

画像生成AIを扱うルーン

ものだけ指定する

これは通常、良くない。カオスになる。

スタイルを指定する

アーティストを指定する

made by Pablo Picasso

スタイルをより具体的にして一貫性のあるものにする

仕上げ

For instance, if you want to make your image more artistic, add “trending on artstation”. if you want to add more realistic lighting add “Unreal Engine.”

苦手な表現

https://note.com/shi3zblog/n/nc9a0d759abf7

「リンゴ型の戦車」はMidjourneyではできるがDream Studioは全然できない

StableDiffusionの学習にはかなり安い計算機を使っても2億円もかかったという

https://note.com/shi3zblog/n/n1b2b0e32bd60

開発者は600kと言っている

@EMostaque: @KennethCassel We actually used 256 A100s for this per the model card, 150k hours in total so at market price $600k

Stable Diffusion#63029c14774b17000013823c

実践

Dream Studioと遊ぶ

NSFW差し替えコード

https://github.com/CompVis/stable-diffusion/blob/69ae4b35e0a0f6ee1af8bb9a5d0016ccb27e36dc/scripts/txt2img.py#L88-L94

機械学習されたコードかどうか見分けることができる

https://github.com/ShieldMnt/invisible-watermark

@imos: stable diffusionのモデルって1GBくらいなのだけど、画像丸暗記しようと思ったら数千枚分くらいしか覚えられないのに、

1024^3/(24*512*512/8)〜1365枚

あれだけ豊かな生成ができているの感動がある。1GBなのはハード的制約が強く、正直かつかつ（100MBならたぶん無理）なはずで、計算機の進歩で起きる未来の可能性に期待がかかる。

hardmaru(David Ha)がjoin

@hardmaru: Personal update: I joined @StabilityAI as head of strategy!

I can see the creative energy unleashed when people collectively gain control of new transformative technologies like large generative models. I want to create a future where open-source 'foundation' models is the norm.

https://pbs.twimg.com/media/Fer6t2RaMAAM6NH.png